46 research outputs found
Navigating multilingual news collections using automatically extracted information
We are presenting a text analysis tool set that allows analysts in various
fields to sieve through large collections of multilingual news items quickly
and to find information that is of relevance to them. For a given document
collection, the tool set automatically clusters the texts into groups of
similar articles, extracts names of places, people and organisations, lists the
user-defined specialist terms found, links clusters and entities, and generates
hyperlinks. Through its daily news analysis operating on thousands of articles
per day, the tool also learns relationships between people and other entities.
The fully functional prototype system allows users to explore and navigate
multilingual document collections across languages and time.Comment: This paper describes the main functionality of the JRC's
fully-automatic news analysis system NewsExplorer, which is freely accessible
in currently thirteen languages at http://press.jrc.it/NewsExplorer/ . 8
page
A tool set for the quick and efficient exploration of large document collections
We are presenting a set of multilingual text analysis tools that can help
analysts in any field to explore large document collections quickly in order to
determine whether the documents contain information of interest, and to find
the relevant text passages. The automatic tool, which currently exists as a
fully functional prototype, is expected to be particularly useful when users
repeatedly have to sieve through large collections of documents such as those
downloaded automatically from the internet. The proposed system takes a whole
document collection as input. It first carries out some automatic analysis
tasks (named entity recognition, geo-coding, clustering, term extraction),
annotates the texts with the generated meta-information and stores the
meta-information in a database. The system then generates a zoomable and
hyperlinked geographic map enhanced with information on entities and terms
found. When the system is used on a regular basis, it builds up a historical
database that contains information on which names have been mentioned together
with which other names or places, and users can query this database to retrieve
information extracted in the past.Comment: 10 page
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages
We present a new, unique and freely available parallel corpus containing
European Union (EU) documents of mostly legal nature. It is available in all 20
official EUanguages, with additional documents being available in the languages
of the EU candidate countries. The corpus consists of almost 8,000 documents
per language, with an average size of nearly 9 million words per language.
Pair-wise paragraph alignment information produced by two different aligners
(Vanilla and HunAlign) is available for all 190+ language pair combinations.
Most texts have been manually classified according to the EUROVOC subject
domains so that the collection can also be used to train and test multi-label
classification algorithms and keyword-assignment software. The corpus is
encoded in XML, according to the Text Encoding Initiative Guidelines. Due to
the large number of parallel texts in many languages, the JRC-Acquis is
particularly suitable to carry out all types of cross-language research, as
well as to test and benchmark text analysis software across different languages
(for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for
download at http://langtech.jrc.it/JRC-Acquis.htm
Multilingual person name recognition and transliteration
Nous présentons ici un outil de repérage des noms de personnes, à partir d’articles de la presse internationale, capable de reconnaître les différentes variantes d’un même nom. L’originalité de notre approche vient de l’identification des variantes de noms à travers les langues et systèmes d’écriture, grec, cyrillique et arabe compris. Étant donné notre contexte multilingue, nous utilisons une représentation interne standard de chaque nom ainsi qu’une même mesure de similarité (au lieu d’adopter l’approche bilingue habituelle de la translittération). Ce module fait partie d’un outil plus général qui analyse en moyenne 15.000 articles de journaux chaque jour, afin de regrouper les documents similaires, aussi bien dans une même langue que dans des langues différentes.We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages